Row-by-Row Scanning Systems for IBM Punched Cards as Applied to Information Retrieval Problems

In the art of mechanical information retrieval, one of the basic methods for identifying documents for selection consists of scanning through a set of document record cards. These record cards contain sets of data which characterize the contents of the individual documents. By comparing this data with a set of data characterizing an inquiry, documents may be identified which, by virtue of similarity of the respective characteristics, are liable to be relevant to the inquiry.

In using IBM punched cards as a medium for recording data for retrieval by machine, the system designer has to adjust his methods to the capabilities inherent to these devices. This is particularly true if such systems are to utilize comparatively simple equipment currently available. Such equipment lacks certain functions which are most desirable in processing information for retrieval. One of these is the ability of scanning information in serial fashion, column-by-column from left to right on the card, and of performing a reasonable degree of logic operations for the purpose of selection.

This requirement is based on the necessity of enumerating a varying number of characteristics and of having access to them for purposes of comparison no matter where they are located on a record card. However, the mode of operation of the machines in question demands that the columnar location of a desired entry be known beforehand and that, therefore, the entry be assignable to a fixed field.

In order not to forego the many advantages which punched card systems offer in processing information, a number of compromise methods have been introduced as a substitute for truly serial scanning of alphabetic or numeric punched card records. The one most widely advocated is that of superimposed coding. By this method information is no longer spelled out letter by letter or number by number. Instead, word or number notations are translated into a code of varying hole locations within a suitably large table or matrix. This table is assigned a fixed location on the card. The codes for the various characteristics to be recorded are then all punched into this same field. The result is a pattern of holes representative not only of the originally intended codes but also of spurious new code combinations, many of which may stand for characteristics which do not apply. As a consequence, when comparing for the presence of code combinations representative of the inquiry, the selection may contain false answers. This necessitates manual analysis of the answers to delete the wrong ones. It is true that by appropriate design these occurrences may be minimized, and that in certain applications some degree of such "noise" may be tolerated. But the system has other limitations. Once the entries are made into the field, there is no way of identifying thereafter what combination the code marks were intended to represent. Also there is no way of relating two or several codes to express interdependence or combination of certain characteristics.

The row-by-row scanning system to be described here is another substitute for truly serial scanning. However, this system not only avoids the conditions just mentioned, but also offers certain advantages over column-by-column scanning.

By: H. P. Luhn

Published in: RC100 in 1959

LIMITED DISTRIBUTION NOTICE:

This Research Report is available. This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). I have read and understand this notice and am a member of the scientific community outside or inside of IBM seeking a single copy only.

RC100.pdf

Questions about this service can be mailed to reports@us.ibm.com .